We study the learning dynamics of self-predictive learning for reinforcement learning, a family of algorithms that learn representations by minimizing the prediction error of their own future latent representations. Despite its recent empirical success, such algorithms have an apparent defect: trivial representations (such as constants) minimize the prediction error, yet it is obviously undesirable to converge to such solutions. Our central insight is that careful designs of the optimization dynamics are critical to learning meaningful representations. We identify that a faster paced optimization of the predictor and semi-gradient updates on the representation, are crucial to preventing the representation collapse. Then in an idealized setup, we show self-predictive learning dynamics carries out spectral decomposition on the state transition matrix, effectively capturing information of the transition dynamics. Building on the theoretical insights, we propose bidirectional self-predictive learning, a novel self-predictive algorithm that learns two representations simultaneously. We examine the robustness of our theoretical insights with a number of small-scale experiments and showcase the promise of the novel representation learning algorithm with large-scale experiments.
translated by 谷歌翻译
我们研究了分销RL的多步非政策学习方法。尽管基于价值的RL和分布RL之间的相似性明显相似,但我们的研究揭示了多步环境中两种情况之间的有趣和根本差异。我们确定了依赖路径依赖性分布TD误差的新颖概念,这对于原则上的多步分布RL是必不可少的。基于价值的情况的区别对诸如后视算法等概念的重要含义具有重要意义。我们的工作提供了多步非政策分布RL算法的第一个理论保证,包括适用于多步分配RL现有方法的结果。此外,我们得出了一种新颖的算法,即分位数回归 - 逆转录,该算法导致了深度RL QR QR-DQN-RETRACE,显示出对Atari-57基准上QR-DQN的经验改进。总的来说,我们阐明了多步分布RL中如何在理论和实践中解决多个独特的挑战。
translated by 谷歌翻译
我们提出BYOL-QUENPLORE,这是一种在视觉复杂环境中进行好奇心驱动的探索的概念上简单但一般的方法。Byol-explore通过优化潜在空间中的单个预测损失而没有其他辅助目标,从而学习了世界代表,世界动态和探索政策。我们表明,BYOL探索在DM-HARD-8中有效,DM-HARD-8是一种具有挑战性的部分可观察的连续操作硬探索基准,具有视觉富含3-D环境。在这个基准上,我们完全通过使用Byol-explore的内在奖励来纯粹通过增强外部奖励来解决大多数任务,而先前的工作只能通过人类的示威来脱颖而出。作为Byol-explore的一般性的进一步证据,我们表明它在Atari的十个最难的探索游戏中实现了超人的性能,同时设计比其他竞争力代理人要简单得多。
translated by 谷歌翻译
我们提出了在表格,依赖阶段的,情节的马尔可夫决策过程中使用贝叶斯-UCBVI算法进行增强学习的:Kaufmann等人的贝叶斯-UCB算法的自然扩展。 (2012年)用于多军匪徒。我们的方法将Q值函数后部的分位数用作最佳Q值函数上的上限。对于贝叶斯-UCBVI,我们证明了一个遗憾的是$ \ wideTilde {o}(\ sqrt {h^3sat})$,其中$ h $是一集的长度,$ s $是$ s $的数量,$ a $ a $动作数量,$ t $情节数,与$ \ omega(\ sqrt {h^3sat})$符合poly-$ \ $ \ log $ enter $ h,s,s,a,a,a,a,a ,适用于足够大的$ t $的t $。据我们所知,这是第一种获得对地平线$ h $(和$ s $)的最佳依赖性的算法,而无需涉及伯恩斯坦的奖金或噪音。对于我们的分析而言,至关重要的是一种新的细粒抗浓缩,以具有独立感兴趣的加权dirichlet总和。然后,我们解释了如何轻松地将贝叶斯-UCBVI延伸到表格环境之外,从而在我们的算法和贝叶斯引导之间表现出牢固的联系(Rubin,1981)。
translated by 谷歌翻译
尽管META强化学习的经验成功(META-RL),但理论和实践之间仍有一个不太理解的差异。批判性地,偏置梯度估计几乎始终在实践中实现,而在Meta-RL上的先前理论仅在非偏见的梯度估计下建立会聚。在这项工作中,我们调查这种差异。特别地,(1)我们表明,无偏渐变的渐变估计具有方差$ \ theta(n)$,其线性取决于内循环更新的示例大小$ n $; (2)我们提出了线性化得分函数(LSF)渐变估计,其具有偏见$ \ Mathcal {O}(1 / \ SQRT {n})$和方差$ \ mathcal {o}(1 / n)$; (3)我们表明,实际上实际上有效地实现了LSF梯度估计的变体。这意味着实用的算法“意外地”引入偏差以实现更好的性能; (4)我们建立了对静止点的收敛性的LSF梯度估计的理论担保,显示比现有工作的更好依赖性,当$ N $很大时。
translated by 谷歌翻译
模型 - 不可知的元增强学习需要估算价值函数的黑森斯矩阵。这是从实施角度挑战,反复区分政策梯度估计可能导致偏见的Hessian估计。在这项工作中,我们提供了一个统一的框架,用于估算价值函数的高阶导数,基于禁止策略评估。我们的框架将许多现有方法解释为特殊情况,并阐明了Hessian估计的偏差和方差权衡。该框架还打开了一个新的估计系列的大门,这可以通过自动差异化库轻松实现,并在实践中导致性能提升。
translated by 谷歌翻译
我们考虑在大型混合搜索空间上有效的黑箱优化问题,由高尺寸连续空间和复杂的组合空间的混合物组成。这样的例子通常在进化计算中产生,也是最近,神经发展和架构寻求强化学习(RL)政策。然而,不幸的是,以前的基于突变的方法在理论上和实际上均在高尺寸连续空间中遭受。因此,我们通过以高效的神经结构搜索(ENAS)引入的高度可扩展和直观的方式,通过组合进化策略和组合优化技术来提出ES-ZHAS,这是一个简单的联合优化过程,通过高效的神经结构搜索(ENAS)引入的一拍或超空地范式。 。通过这种相对简单的婚姻之间的两种不同的研究,我们能够通过优化混合空间以及通过边缘修剪和量化在流行的RL上优化BBOB功能以及组合神经网络架构来验证我们最佳的方法。基准。由于算法的模块化,我们还能够包含各种流行的技术,从不同的连续和组合优化器以及约束优化。
translated by 谷歌翻译
As one of the most important psychic stress reactions, micro-expressions (MEs), are spontaneous and transient facial expressions that can reveal the genuine emotions of human beings. Thus, recognizing MEs (MER) automatically is becoming increasingly crucial in the field of affective computing, and provides essential technical support in lie detection, psychological analysis and other areas. However, the lack of abundant ME data seriously restricts the development of cutting-edge data-driven MER models. Despite the recent efforts of several spontaneous ME datasets to alleviate this problem, it is still a tiny amount of work. To solve the problem of ME data hunger, we construct a dynamic spontaneous ME dataset with the largest current ME data scale, called DFME (Dynamic Facial Micro-expressions), which includes 7,526 well-labeled ME videos induced by 671 participants and annotated by more than 20 annotators throughout three years. Afterwards, we adopt four classical spatiotemporal feature learning models on DFME to perform MER experiments to objectively verify the validity of DFME dataset. In addition, we explore different solutions to the class imbalance and key-frame sequence sampling problems in dynamic MER respectively on DFME, so as to provide a valuable reference for future research. The comprehensive experimental results show that our DFME dataset can facilitate the research of automatic MER, and provide a new benchmark for MER. DFME will be published via https://mea-lab-421.github.io.
translated by 谷歌翻译
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Our fine-tuned Transformer baselines show promising results, with models performing well above random on most questions. However, on a large subset of questions, there is still room for significant improvement. As the only expert-annotated merger agreement dataset, MAUD is valuable as a benchmark for both the legal profession and the NLP community.
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译